Skip to content

Enrich DurabilityAgent.CheckHealthAsync with persistence-layer signals (#2646)#2649

Merged
jeremydmiller merged 1 commit intomainfrom
2646-enrich-durability-health
May 1, 2026
Merged

Enrich DurabilityAgent.CheckHealthAsync with persistence-layer signals (#2646)#2649
jeremydmiller merged 1 commit intomainfrom
2646-enrich-durability-health

Conversation

@jeremydmiller
Copy link
Copy Markdown
Member

Closes #2646.

Summary

The three durability agents — Wolverine.RDBMS.DurabilityAgent, RavenDbDurabilityAgent, CosmosDbDurabilityAgent — all relied on the default IAgent.CheckHealthAsync: Status == Running ? Healthy : Unhealthy. That hides the cases monitoring tools (CritterWatch's Agents tab) actually need to flag: a healthy-looking agent silently failing to reach the store, the dead-letter queue ballooning because handlers are dying, or a recovery loop that's not draining a stuck batch.

Threads three new persistence signals through every durability agent:

  1. Persistence reachability — each agent's poll loop wraps its tick in try/catch and feeds the outcome into a per-agent DurabilityHealthSignals instance. CheckHealthAsync also pings the store via FetchCountsAsync. One failed cycle ⇒ Degraded with the underlying error message; N consecutive failures (default 3, DurabilitySettings.HealthConsecutiveFailureUnhealthyThreshold) ⇒ Unhealthy.

  2. Dead-letter queue growth — between consecutive evaluations, compare the PersistedCounts.DeadLetter delta against DurabilitySettings.HealthDeadLetterGrowthPerMinuteThreshold (default 100/min). Above threshold ⇒ Degraded with the rate in the description.

  3. Stuck recovery / scheduled-job pollers — if persisted inbox+outbox (or scheduled) counts stay non-zero and never decrease across DurabilitySettings.HealthStuckPollCycleThreshold consecutive evaluations (default 3) ⇒ Degraded. Catches the "single bad envelope blocks the queue" case the issue calls out.

Status precedence: a non-Running status always returns Unhealthy first; then the consecutive-failure Unhealthy; then the worst aggregated Degraded. Multiple Degraded signals are joined into a single ;-separated description so operators see the full picture in one tooltip.

DurabilityHealthSignals is intentionally public so per-store agents from the RavenDb / CosmosDb assemblies (which do not have InternalsVisibleTo into Wolverine) can use it directly. The class is deliberately small: shared mutable state, RecordPollSuccess / RecordPollFailure mutators, and a single Evaluate() that takes the current PersistedCounts snapshot.

Files

  • src/Wolverine/Persistence/Durability/DurabilityHealthSignals.cs (new) — the shared evaluator.
  • src/Wolverine/DurabilitySettings.cs — three new threshold properties (defaults: 100/min DLQ growth, 3 stuck cycles, 3 consecutive failures).
  • src/Persistence/Wolverine.RDBMS/DurabilityAgent.cs — replaces the existing _successCount / _exceptionCount rolling logic with the shared signals; adds count-based signals.
  • src/Persistence/Wolverine.RavenDb/Internals/Durability/RavenDbDurabilityAgent.cs — adds CheckHealthAsync override; wraps each recovery + scheduled-job tick in try/catch to feed the signals.
  • src/Persistence/Wolverine.CosmosDb/Internals/Durability/CosmosDbDurabilityAgent.cs — same shape as RavenDb.

Test plan

  • New CoreTests/Persistence/durability_health_signals_tests covers the helper in isolation: status precedence, single-failure Degraded, threshold-based Unhealthy, DLQ growth above + below threshold, stuck-recovery + stuck-scheduled with reset behaviour, multi-signal aggregation, and the diagnostic counter accessor. 12/12 green.
  • Full CoreTests suite green: Failed: 0, Passed: 1421, Total: 1421, Duration: 3m 53s.

🤖 Generated with Claude Code

…tence signals (#2646)

The three durability agents (Wolverine.RDBMS, Wolverine.RavenDb, Wolverine.CosmosDb)
all relied on the default IAgent.CheckHealthAsync — Status==Running ? Healthy : Unhealthy.
That hides the cases monitoring tools (CritterWatch's Agents tab) actually need to
flag: a healthy-looking agent silently failing to reach the store, the DLQ ballooning
because handlers are dying, or a recovery loop that's not draining a stuck batch.

This commit threads three new persistence signals through every durability agent:

1. **Persistence reachability** — each agent's poll loop now wraps its tick in a
   try/catch and feeds the outcome into a per-agent `DurabilityHealthSignals`
   instance. CheckHealthAsync also pings the store via FetchCountsAsync. One failed
   cycle ⇒ Degraded with the underlying error message; N consecutive failures
   (default 3, `DurabilitySettings.HealthConsecutiveFailureUnhealthyThreshold`) ⇒
   Unhealthy.

2. **Dead-letter queue growth** — between consecutive evaluations, compare the
   `PersistedCounts.DeadLetter` delta against
   `DurabilitySettings.HealthDeadLetterGrowthPerMinuteThreshold` (default 100/min).
   Above threshold ⇒ Degraded with the rate in the description.

3. **Stuck recovery / scheduled-job pollers** — if the persisted inbox+outbox
   total (or scheduled count) stays non-zero and never decreases across
   `DurabilitySettings.HealthStuckPollCycleThreshold` consecutive evaluations
   (default 3) ⇒ Degraded. Catches the "single bad envelope blocks the queue"
   case the issue calls out.

Status precedence: a non-Running status always returns Unhealthy first; then the
consecutive-failure Unhealthy; then the worst aggregated Degraded. Multiple
Degraded signals are joined into a single `;`-separated description so operators
see the full picture in one tooltip.

`DurabilityHealthSignals` is intentionally public so per-store agents from the
RavenDb / CosmosDb assemblies (which do not have InternalsVisibleTo into Wolverine)
can use it directly. The class is deliberately small: shared mutable state,
RecordPollSuccess/Failure mutators, and a single Evaluate() that takes the
current PersistedCounts snapshot.

Test plan:
- New CoreTests/Persistence/durability_health_signals_tests covers the helper
  in isolation: status precedence, single-failure Degraded, threshold-based
  Unhealthy, DLQ growth above + below threshold, stuck-recovery + stuck-scheduled
  with reset behaviour, multi-signal aggregation, and the diagnostic counter
  accessor. 12/12 green.
- Full CoreTests suite green: Failed: 0, Passed: 1421, Total: 1421.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit 64b31ea into main May 1, 2026
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enrich DurabilityAgent.CheckHealthAsync with persistence-layer signals

1 participant